Summary

Kaggle describes this competition as: Ask a home buyer to describe their dream house, and they probably won’t begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition’s dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

Libraries & Data

library(skimr) #better data overview
## 
## Attaching package: 'skimr'
## The following object is masked from 'package:stats':
## 
##     filter
library(scales) #moneys labels
library(plyr) #data manipulation
library(tidyverse) #graphing, piping, csv reading
## ── Attaching packages ────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::arrange()    masks plyr::arrange()
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::compact()    masks plyr::compact()
## ✖ dplyr::count()      masks plyr::count()
## ✖ purrr::discard()    masks scales::discard()
## ✖ dplyr::failwith()   masks plyr::failwith()
## ✖ dplyr::filter()     masks skimr::filter(), stats::filter()
## ✖ dplyr::id()         masks plyr::id()
## ✖ dplyr::lag()        masks stats::lag()
## ✖ dplyr::mutate()     masks plyr::mutate()
## ✖ dplyr::rename()     masks plyr::rename()
## ✖ dplyr::summarise()  masks plyr::summarise()
## ✖ dplyr::summarize()  masks plyr::summarize()
library(ggforce) # paginated facet wraps
library(caret) # model training
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(arm) #centering, rescaling vars
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
## 
##     expand
## Loading required package: lme4
## 
## arm (Version 1.10-1, built: 2018-4-12)
## Working directory is /Users/meag/Desktop/IS 6489/Housing - NonCurropt
## 
## Attaching package: 'arm'
## The following object is masked from 'package:scales':
## 
##     rescale
library(corrplot) #plotting correlation
## corrplot 0.84 loaded
## 
## Attaching package: 'corrplot'
## The following object is masked from 'package:arm':
## 
##     corrplot
# Import test and training files
train <- read.csv("train.csv")
test <- read.csv("test.csv")

# Combine data for cleaning/exploration
test_labels <- test$Id
test$SalePrice <- NA
combined <- rbind(train, test)

Data Cleaning

I started off by filling in missing values in both the training and test sets. For variables which are physical features, such as whether a house has an Alley, I imputed a No Alley category, assuming they simply did not have the feature.

For descriptive fields of house features, such as the square footage of a basement, I entered a 0 when it appeared the feature did not exist. There were 81 observations, for example, which had the majority of the basement variables as NAs.

For other variables which every house should have, like the type of sale completed, I imputed the median for numeric variables or mode for categorical variables. There were a few variables I gave special treatment to - I imputed the year the house was built for the missing values which had a garage, but not a value for the year the garage was built.

There were a number of categorical variables which seemed to have an order, which I ordered as appropriate, such as “Gentle, Moderate, Severe” for the variable describing whether the land was sloped or flat.

I created a quality vector which I assign to each quality variables. I use these to create a quality super variable later on.

Quality Vector for new variable

quality_null <- 3
Qualities <- c("NA" = quality_null, 'None' = quality_null, 'Po' = 1, 'Fa' = 2, 'TA' = 7, 'Gd' = 13, 'Ex' = 21)
Quality_vector <- c("NA", "Po", "Fa", "TA", "Gd", "Ex")

Exterior

# Re-factor
combined <- combined %>% 
  mutate(
    ExterQual = ordered(ExterQual, levels=Quality_vector), 
    ExterQual_num = as.integer(revalue(ExterQual, Qualities)),
    ExterCond = ordered(ExterCond, levels=Quality_vector),
    SaleType = factor(SaleType),
    MSSubClass = factor(MSSubClass)
  )
## The following `from` values were not present in `x`: None
# Impute NAs
combined <- combined %>% 
  mutate(
    MasVnrArea = ifelse(is.na(MasVnrArea),0,MasVnrArea),
    MasVnrType = fct_explicit_na(MasVnrType, na_level = "None"), #mode, also probably doesnt have brick
    ExterQual_num = ifelse(is.na(ExterQual_num),quality_null,ExterQual_num),
    ExterCond = ordered(ExterCond, levels=Quality_vector),
    Exterior1st = fct_explicit_na(Exterior1st, na_level = "VinylSd"), #mode
    Exterior2nd = fct_explicit_na(Exterior2nd, na_level = "VinylSd"), #mode
    SaleType = fct_explicit_na(SaleType, na_level = "WD"), #mode
)
summary(combined$MasVnrArea)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     0.0   101.4   163.5  1600.0

Lot

# Re-factor
combined <- combined %>% 
  mutate(
    LotShape = ordered(LotShape, levels=c("Reg", "IR1", "IR2", "IR3")),
    LandSlope = ordered(LandSlope, levels=c("Gtl", "Mod", "Sev")),
  )

# Impute NAs
combined <- combined %>% 
  mutate(
    MSZoning = fct_explicit_na(factor(MSZoning), na_level = "RM"),
    Alley = fct_explicit_na(factor(Alley), na_level = "No Alley"),
    LotFrontage = ifelse(is.na(LotFrontage),0,LotFrontage),
)

Comfort amenities

# Re-factor

combined <- combined %>% 
  mutate(
    KitchenQual = ordered(KitchenQual, levels=Quality_vector),
    KitchenQual_num = as.integer(revalue(KitchenQual, Qualities)),
    FireplaceQu = ordered(FireplaceQu, levels=Quality_vector),
    PoolQC = ordered(PoolQC,  levels=Quality_vector),

)
## The following `from` values were not present in `x`: None
# Impute NAs
combined <- combined %>% 
  mutate(
    KitchenQual = fct_explicit_na(KitchenQual, na_level = "TA"), # setting to mode
    KitchenQual_num = ifelse(is.na(KitchenQual_num),quality_null,KitchenQual_num),
    FireplaceQu = fct_explicit_na(FireplaceQu, na_level = "NA"),
    MiscFeature = fct_explicit_na(MiscFeature, na_level = "NA"),
    Fence = fct_explicit_na(factor(Fence), na_level = "NA"),
    PoolQC = fct_explicit_na(PoolQC, na_level = "NA"),
    Functional = fct_explicit_na(Functional, na_level = "Typ"), # setting to mode
)

Basement

# Re-factor

combined <- combined %>% 
  mutate(
    BsmtQual = ordered(BsmtQual, levels=Quality_vector), 
    BsmtQual_num = as.integer(revalue(BsmtQual, Qualities)),
    BsmtCond = ordered(BsmtCond,  levels=Quality_vector), 
    BsmtExposure = ordered(BsmtExposure, levels=c("No", "Mn", "Av", "Gd")),
    BsmtFinType1 = ordered(BsmtFinType1,  levels=c("Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ")),
    BsmtFinType2 = ordered(BsmtFinType2, levels=c("Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ")),

)
## The following `from` values were not present in `x`: None
# Impute NAs
combined <- combined %>% 
  mutate(
    BsmtQual = fct_explicit_na(BsmtQual, na_level = "NA"),
    BsmtQual_num = ifelse(is.na(BsmtQual_num),quality_null,BsmtQual_num),
    BsmtCond = fct_explicit_na(BsmtCond, na_level = "NA"),
    BsmtExposure = fct_explicit_na(BsmtExposure, na_level = "NA"),
    BsmtFinType1 = fct_explicit_na(BsmtFinType1, na_level = "NA"),
    BsmtFinType2 = fct_explicit_na(BsmtFinType1, na_level = "NA"),
    BsmtUnfSF = ifelse(is.na(BsmtUnfSF),0,BsmtUnfSF),
    TotalBsmtSF = ifelse(is.na(TotalBsmtSF),0,TotalBsmtSF),
    BsmtFullBath = ifelse(is.na(BsmtFullBath),0,BsmtFullBath),
    BsmtHalfBath = ifelse(is.na(BsmtHalfBath),0,BsmtHalfBath),
    BsmtFinSF1 = ifelse(is.na(BsmtFinSF1),0,BsmtFinSF1),
    BsmtFinSF2 = ifelse(is.na(BsmtFinSF2),0,BsmtFinSF2)
  )

Utilities

# Re-factor

combined <- combined %>% 
  mutate(
    HeatingQC = ordered(HeatingQC, levels=Quality_vector),
)

# Impute NAs
combined <- combined %>% 
  mutate(
    Electrical = fct_explicit_na(factor(Electrical), na_level = "NA"),
    Utilities = fct_explicit_na(factor(Utilities), na_level = "AllPub"),
)

Garage

# Re-factor

combined <- combined %>% 
  mutate(
    GarageQual = ordered(GarageQual,  levels=Quality_vector),
    GarageQual_num = as.integer(revalue(GarageQual, Qualities)),
    GarageCond = ordered(GarageCond, levels=Quality_vector),
    GarageFinish = ordered(GarageFinish, levels=c("Unf", "RFn", "Fin")),
)
## The following `from` values were not present in `x`: None
# Impute NAs
combined <- combined %>% 
  mutate(
    GarageQual = fct_explicit_na(GarageQual, na_level = "NA"),
    GarageQual_num = ifelse(is.na(GarageQual_num),quality_null,GarageQual_num),
    GarageCond = fct_explicit_na(GarageCond, na_level = "NA"),
    GarageType = fct_explicit_na(GarageType, na_level = "NA"),
    GarageFinish = fct_explicit_na(GarageFinish, na_level = "NA"),
    GarageCars = ifelse(is.na(GarageCars),0,GarageCars),
    GarageArea = ifelse(is.na(GarageArea),0,GarageArea),
    GarageYrBlt = ifelse(is.na(GarageYrBlt), YearBuilt, GarageYrBlt),
)

Feature Engineering

I created 9 variables where I thought additional value could be gained by combining existing variables. I describe each of these below.

Total Quality - A numeric variable alternative to the overall quality variable provided. There are a series of quality variables, like fireplace quality or kitchen quality, which had an obvious order to them. For each of these variables, I created a new variable which I mapped to a numeric value given the scale listed in data cleaning.

Notice poor and fair quality is weighted lower than if the house didn’t have the feature at all. On the opposite side of the spectrum, Excellent has a mis-proportionally higher weighting. I then used these series of numeric variables to create a new ‘Overall Quality’ variable, which I made by multiplying the following together: exterior quality, kitchen quality, basement quality, garage quality, and overall quality. Having just a nice kitchen is not the same as also having a nice exterior, basement, and garage.

Total Squarefeet - A numeric variable accounting for total square footage, from both the basement and above ground levels.

Total Baths - A numeric variable summing of the bath variables: full baths and half baths from the above ground levels and the basement levels. Half baths are counted as .5.

Total Rooms - A numeric variable summing all of the room variables: total rooms above ground, total baths, and total cars in the garage.

Room to Bath ratio - A numeric variable dividing the number of bathrooms by the number of rooms.

House age - A numeric variable subtracting house’s year built from year of sale.

Remodeled age - A numeric variable subtracting the house’s last remodeled age from year of sale.

Yard - A numeric variable subtracting the square footage of the first floor from the total lot area.

Total porch area - A numeric variable summing the square footage of all the different types of porches: wood porches, open porches, enclosed porches, 3-season porches, and screened porches.

combined <- combined %>% 
  mutate(
    total_baths = (FullBath + BsmtFullBath + HalfBath*.5 + BsmtHalfBath*.5), # Add all baths togehter
    total_rooms = total_baths + GarageCars + TotRmsAbvGrd, # Add all rooms togeather
    room_bath_ratio = total_baths/total_rooms, # ration of bathrooms to bedrooms
    house_age = YrSold - YearBuilt, # a b
    remodeled = ifelse(YearBuilt==YearRemodAdd,0,1),
    isNew = ifelse(YrSold==YearBuilt,1,0),
    yard = LotArea - X1stFlrSF,
    total_sf = TotalBsmtSF + GrLivArea, 
    total_quality = ExterQual_num * KitchenQual_num * BsmtQual_num * OverallQual * GarageQual_num,
    total_porch_area = WoodDeckSF + OpenPorchSF + EnclosedPorch + X3SsnPorch + ScreenPorch,
  )

Further EDA

A quick correlation plot of all of the numeric variables showed Overall Quality had a .79 correlation with SalePrice. The next more important variables were above grade living square footage and number of cars the garage had. I made note of these for creating new variables. I saw some colinearity between a few variables - the square footage of the garage and the number of cars a garage could hold was a strong example. Another one is the year the house was built and the year the garage was built.

Additionally, I plotted all the numeric variables in bivariate plots against SalePrice to check for linearity. Six variables demonstrated a non-linear relationship. I discuss these below.

Of the categorical variables, neighborhoods had one of the highest variances when plotted against Sales Price. I use this later for interactions during model training.

Neighborhoods are a good variable to play with

neigh_summary <- combined %>% 
  filter(!is.na(SalePrice)) %>% 
  group_by(Neighborhood) %>% 
  summarize(
    mean = mean(SalePrice),
    q25 = quantile(SalePrice, .25),
    q50 = median(SalePrice),
    q75 = quantile(SalePrice, .75),
    sd = sd(SalePrice)
  ) %>% 
  arrange(desc(q50))

combined %>% 
  filter(!is.na(SalePrice)) %>% 
  group_by(Neighborhood) %>% 
  ggplot(aes(reorder(Neighborhood, SalePrice), SalePrice)) + 
  geom_boxplot() + 
  scale_y_continuous(labels = dollar) + 
  ggtitle("Neighborhood ~ Sale Price") + 
  xlab("Neighborhood") + 
  coord_flip()

combined %>% 
  filter(!is.na(SalePrice)) %>% 
  group_by(Neighborhood, OverallQual) %>% 
  ggplot(aes(OverallQual, reorder(Neighborhood, OverallQual))) + 
  geom_jitter()

Quality and square footage have high correlation

# Create pairwise corr object
vars_numeric <- which(sapply(combined, is.numeric))
vars_factor <- which(sapply(combined, is.factor))
all_nums <- combined[, vars_numeric]
all_nums <- all_nums %>% 
  filter(SalePrice != 0)
corr_num <- cor(all_nums, use="pairwise.complete.obs") 

# sort by correlation with sales price and plot
cor_sorted <- as.matrix(sort(corr_num[,'SalePrice'], decreasing = TRUE))
cor_high <- names(which(apply(cor_sorted, 1, function(x) abs(x)>.5)))
corr_num <- corr_num[cor_high, cor_high]

corrplot.mixed(corr_num, tl.col="black", tl.pos="lt", tl.cex = .8, cl.cex = .8, pch.cex = 1,  addCoefasPercent=TRUE)

corr_num
##                  SalePrice total_quality OverallQual   total_sf
## SalePrice        1.0000000     0.8315077   0.7909816  0.7789588
## total_quality    0.8315077     1.0000000   0.8796764  0.6550321
## OverallQual      0.7909816     0.8796764   1.0000000  0.6648303
## total_sf         0.7789588     0.6550321   0.6648303  1.0000000
## total_rooms      0.7259413     0.5948749   0.6181136  0.7765711
## GrLivArea        0.7086245     0.5391394   0.5930074  0.8803240
## ExterQual_num    0.6826392     0.8698183   0.7262785  0.5288352
## KitchenQual_num  0.6595997     0.8094790   0.6733308  0.4987799
## BsmtQual_num     0.6489383     0.7891200   0.6778120  0.5134487
## GarageCars       0.6404092     0.6255874   0.6006707  0.5296076
## total_baths      0.6317311     0.5463351   0.5410628  0.6005093
## GarageArea       0.6234314     0.6049895   0.5620218  0.5584659
## TotalBsmtSF      0.6135806     0.5839370   0.5378085  0.8228884
## X1stFlrSF        0.6058522     0.5082720   0.4762238  0.7976783
## FullBath         0.5606638     0.5252135   0.5505997  0.5744031
## TotRmsAbvGrd     0.5337232     0.3771657   0.4274523  0.6788025
## YearBuilt        0.5228973     0.6463589   0.5723228  0.3354884
## GarageYrBlt      0.5080433     0.6375334   0.5550219  0.3344535
## YearRemodAdd     0.5071010     0.6028415   0.5506839  0.3384040
## house_age       -0.5233504    -0.6455634  -0.5726295 -0.3363377
##                 total_rooms  GrLivArea ExterQual_num KitchenQual_num
## SalePrice         0.7259413  0.7086245     0.6826392       0.6595997
## total_quality     0.5948749  0.5391394     0.8698183       0.8094790
## OverallQual       0.6181136  0.5930074     0.7262785       0.6733308
## total_sf          0.7765711  0.8803240     0.5288352       0.4987799
## total_rooms       1.0000000  0.8505053     0.4929052       0.4780333
## GrLivArea         0.8505053  1.0000000     0.4359861       0.4205627
## ExterQual_num     0.4929052  0.4359861     1.0000000       0.7161222
## KitchenQual_num   0.4780333  0.4205627     0.7161222       1.0000000
## BsmtQual_num      0.4586546  0.3707950     0.6398382       0.5705734
## GarageCars        0.6761805  0.4672474     0.5263902       0.5098096
## total_baths       0.7466882  0.5951685     0.4752771       0.4649875
## GarageArea        0.6159947  0.4689975     0.4957776       0.4896056
## TotalBsmtSF       0.4391534  0.4548682     0.4705796       0.4326296
## X1stFlrSF         0.5125309  0.5660240     0.3978354       0.3870273
## FullBath          0.7084845  0.6300116     0.4839494       0.4346480
## TotRmsAbvGrd      0.8896204  0.8254894     0.2979028       0.2872785
## YearBuilt         0.3818722  0.1990097     0.5981605       0.5301653
## GarageYrBlt       0.4176555  0.2353869     0.5833311       0.5352806
## YearRemodAdd      0.3839960  0.2873885     0.5873184       0.6253161
## house_age        -0.3824784 -0.2003025    -0.5975223      -0.5292474
##                 BsmtQual_num GarageCars total_baths GarageArea TotalBsmtSF
## SalePrice          0.6489383  0.6404092   0.6317311  0.6234314   0.6135806
## total_quality      0.7891200  0.6255874   0.5463351  0.6049895   0.5839370
## OverallQual        0.6778120  0.6006707   0.5410628  0.5620218   0.5378085
## total_sf           0.5134487  0.5296076   0.6005093  0.5584659   0.8228884
## total_rooms        0.4586546  0.6761805   0.7466882  0.6159947   0.4391534
## GrLivArea          0.3707950  0.4672474   0.5951685  0.4689975   0.4548682
## ExterQual_num      0.6398382  0.5263902   0.4752771  0.4957776   0.4705796
## KitchenQual_num    0.5705734  0.5098096   0.4649875  0.4896056   0.4326296
## BsmtQual_num       1.0000000  0.5153345   0.5336268  0.4596532   0.5197796
## GarageCars         0.5153345  1.0000000   0.4835776  0.8824754   0.4345848
## total_baths        0.5336268  0.4835776   1.0000000  0.4515670   0.4144677
## GarageArea         0.4596532  0.8824754   0.4515670  1.0000000   0.4866655
## TotalBsmtSF        0.5197796  0.4345848   0.4144677  0.4866655   1.0000000
## X1stFlrSF          0.3487472  0.4393168   0.3905573  0.4897817   0.8195300
## FullBath           0.4614966  0.4696720   0.6941971  0.4056562   0.3237224
## TotRmsAbvGrd       0.2213151  0.3622886   0.4602733  0.3378221   0.2855726
## YearBuilt          0.7030818  0.5378501   0.5242983  0.4789538   0.3914520
## GarageYrBlt        0.6506392  0.6198969   0.4747729  0.6028032   0.3459364
## YearRemodAdd       0.5699785  0.4206222   0.4437190  0.3715998   0.2910656
## house_age         -0.7025787 -0.5387274  -0.5225950 -0.4794077  -0.3914978
##                  X1stFlrSF   FullBath TotRmsAbvGrd   YearBuilt GarageYrBlt
## SalePrice        0.6058522  0.5606638   0.53372316  0.52289733   0.5080433
## total_quality    0.5082720  0.5252135   0.37716566  0.64635888   0.6375334
## OverallQual      0.4762238  0.5505997   0.42745234  0.57232277   0.5550219
## total_sf         0.7976783  0.5744031   0.67880245  0.33548845   0.3344535
## total_rooms      0.5125309  0.7084845   0.88962045  0.38187220   0.4176555
## GrLivArea        0.5660240  0.6300116   0.82548937  0.19900971   0.2353869
## ExterQual_num    0.3978354  0.4839494   0.29790278  0.59816048   0.5833311
## KitchenQual_num  0.3870273  0.4346480   0.28727854  0.53016534   0.5352806
## BsmtQual_num     0.3487472  0.4614966   0.22131509  0.70308181   0.6506392
## GarageCars       0.4393168  0.4696720   0.36228857  0.53785009   0.6198969
## total_baths      0.3905573  0.6941971   0.46027330  0.52429826   0.4747729
## GarageArea       0.4897817  0.4056562   0.33782212  0.47895382   0.6028032
## TotalBsmtSF      0.8195300  0.3237224   0.28557256  0.39145200   0.3459364
## X1stFlrSF        1.0000000  0.3806375   0.40951598  0.28198586   0.2657773
## FullBath         0.3806375  1.0000000   0.55478425  0.46827079   0.4638112
## TotRmsAbvGrd     0.4095160  0.5547843   1.00000000  0.09558913   0.1376660
## YearBuilt        0.2819859  0.4682708   0.09558913  1.00000000   0.8451407
## GarageYrBlt      0.2657773  0.4638112   0.13766599  0.84514067   1.0000000
## YearRemodAdd     0.2403793  0.4390465   0.19173982  0.59285498   0.6043527
## house_age       -0.2821427 -0.4684029  -0.09695522 -0.99903594  -0.8442543
##                 YearRemodAdd   house_age
## SalePrice          0.5071010 -0.52335042
## total_quality      0.6028415 -0.64556338
## OverallQual        0.5506839 -0.57262947
## total_sf           0.3384040 -0.33633766
## total_rooms        0.3839960 -0.38247836
## GrLivArea          0.2873885 -0.20030250
## ExterQual_num      0.5873184 -0.59752232
## KitchenQual_num    0.6253161 -0.52924741
## BsmtQual_num       0.5699785 -0.70257872
## GarageCars         0.4206222 -0.53872739
## total_baths        0.4437190 -0.52259496
## GarageArea         0.3715998 -0.47940773
## TotalBsmtSF        0.2910656 -0.39149776
## X1stFlrSF          0.2403793 -0.28214268
## FullBath           0.4390465 -0.46840292
## TotRmsAbvGrd       0.1917398 -0.09695522
## YearBuilt          0.5928550 -0.99903594
## GarageYrBlt        0.6043527 -0.84425426
## YearRemodAdd       1.0000000 -0.59035972
## house_age         -0.5903597  1.00000000
corre <- colnames(corr_num)

No difference in year sold

all_nums %>% 
  gather(-SalePrice,-Id, key="var", value="value") %>% 
  ggplot(aes(x=value, y=SalePrice)) + 
  geom_point() +
  scale_y_continuous(labels=dollar) + 
  facet_wrap_paginate(~ var, scales = "free", nrow=4, ncol=4, page=4) 

  # geom_text(mapping = aes(label=Id))

Bivariate plots show a number of numeric variables which could benefit from being logged

# loop through all numeric vars and create bivariate plot agsisnt sale price
for(i in 1:ncol(all_nums)) {
  if(is.numeric(all_nums[,i])) {
    plot(all_nums[,i],all_nums$SalePrice, main=names(all_nums)[i])
  }
}

Getting to know building type and neighborhood a bit better…

train_v0 <- combined %>% 
  filter(!is.na(SalePrice))

ggplot(train_v0, aes(BldgType, log(SalePrice))) + 
  geom_boxplot()+
  scale_y_continuous(labels=dollar) + 
  ggtitle("BldgType ~ log of SalePrice")

train_v0 %>% 
  filter(BldgType != "1Fam", BldgType != "TwnhsE") %>% 
  group_by(Neighborhood) %>% 
  ggplot(aes(Neighborhood, SalePrice, color=BldgType)) +
  geom_jitter() + 
  scale_y_continuous(labels=dollar) + 
  coord_flip() 

train_v0 %>% 
  filter(BldgType == "1Fam") %>% 
  group_by(Neighborhood) %>% 
  ggplot(aes(Neighborhood, SalePrice, color=OverallQual)) +
  geom_jitter() + 
  scale_y_continuous(labels=dollar) + 
  coord_flip()

train_v0 %>% 
  filter(BldgType == "1Fam") %>% 
  group_by(Neighborhood) %>% 
  ggplot(aes(Neighborhood, SalePrice, color=HouseStyle)) +
  geom_jitter() + 
  scale_y_continuous(labels=dollar) + 
  coord_flip()

ggplot(train_v0, aes(Neighborhood, total_sf, color=log(SalePrice))) + 
  geom_boxplot()+
  scale_y_continuous(labels=dollar)

ggplot(train_v0, aes(total_sf)) + 
    geom_histogram(color='white', alpha=0.6, fill='steelblue') +
    labs(title='Distribution of total SF', y='')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(train_v0, aes(total_quality)) + 
    geom_histogram(color='white', alpha=0.6, fill='steelblue') +
    labs(title='Distribution of quality score', y='')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

train_v0 %>% 
  ggplot(aes(total_quality, SalePrice, color=Neighborhood)) +
  geom_jitter() + 
  scale_y_continuous(labels=dollar) + 
  ggtitle("Total quality 1-5 ~ Sale Price")

ggplot(train_v0, aes(total_baths)) + 
    geom_histogram(color='white', alpha=0.6, fill='steelblue') +
    labs(title='Distribution of total baths', y='')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Statistical Model

I removed 4 outliers which had a low sale price with a very high square footage, as square footage was one of the more strongly correlated variables.

I choose to use all of the variables available, since removing ones which appeared to have collinearity increased out of sample RMSE. I took the log of a few input variables which demonstrated a nonlinear relationship in a bivariate plot against the sale price: above grade square footage, the area of the lot, linear feet of street connected to the property, the area of masonry veneer, the garage area, the total square footage of the basement, and the square footage of wooden decks.

I one hot encoded all variables into a dummy matrix and removed near zero variance variables from the training set. In addition to all of the predictors mentioned, I one hot encoded three interactions: total square footage by neighborhood, total rooms by neighborhood, and total quality by type of house.
Additionally, I centered and scaled all variables, as is required for the glmnet regression technique I choose.

Regression Model

In total, I trained on 1458 samples with 165 predictors. Comparing out of sample RMSE, a log model using glmnet regression performed the best at producing a model estimating Sale Price. The tuned model uses an alpha value of .383 and a lambda value of .008 when setting the seed to 123.

# create dummy matrix with interactions and logged vars 
numeric_l4 <- combined  
dv_l4 <- dummyVars( ~ total_sf * Neighborhood + total_rooms * Neighborhood + total_quality * MSSubClass   + 
                      log(GrLivArea + 1) +
                      log(LotArea + 1) +
                      log(LotFrontage + 1) +
                      log(MasVnrArea + 1) +
                      log(TotalBsmtSF + 1) +
                      log(WoodDeckSF + 1) +
                      ., numeric_l4, fullRank=TRUE)
numeric_l4 <- data.frame(predict(dv_l4, newdata=numeric_l4))

# filter outliers and separate test and train set
extended_l4 <- numeric_l4 %>%
  filter(
    !is.na(SalePrice), 
    Id != c(186, 497, 1299, 524), # Cooks distance over 1
    ) %>%
  dplyr::select(-Id)
## Warning in Id != c(186, 497, 1299, 524): longer object length is not a
## multiple of shorter object length
train_l4 <- extended_l4 [, -nzv(extended_l4)]

test_l4 <- numeric_l4 %>%
  filter(is.na(SalePrice))

# train model
set.seed(123)
model_l4 <- train(
  log(SalePrice) ~ ., 
  data=train_l4, 
  preProcess=c("center", "scale"), 
  method="glmnet",
    tuneGrid = expand.grid(
    alpha = .23, #seq(.18, .23, .001),
    lambda = .012 #seq(.01, .02, .001)
              )
)


# find in-sample statistics
trainRSME_l4 <- train_l4 %>%
  dplyr::select(SalePrice) %>%
  mutate(
    Est_SalePrice = exp(predict(model_l4, newdata = train_l4)),
    priceDiff = SalePrice - Est_SalePrice,
  )

trainRSME_l4 <- trainRSME_l4 %>%
  summarize(
    MAE = MAE(SalePrice, Est_SalePrice),
    RMSE = RMSE(SalePrice, Est_SalePrice),
    R2 = R2(SalePrice, Est_SalePrice)
  )

trainRSME_l4
##        MAE     RMSE        R2
## 1 13124.37 19655.68 0.9399586
# see model results
plot(model_l4$finalModel)

(model_l4$results)
##   alpha lambda      RMSE  Rsquared        MAE      RMSESD  RsquaredSD
## 1  0.23  0.012 0.1167199 0.9148447 0.08193377 0.007493348 0.009355736
##         MAESD
## 1 0.003393271
coef(model_l4, s="lambda.min")
## NULL
# plot(model_l4)
model_l4
## glmnet 
## 
## 1458 samples
##  164 predictor
## 
## Pre-processing: centered (164), scaled (164) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 1458, 1458, 1458, 1458, 1458, 1458, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE       
##   0.1167199  0.9148447  0.08193377
## 
## Tuning parameter 'alpha' was held constant at a value of 0.23
## 
## Tuning parameter 'lambda' was held constant at a value of 0.012

Create submission file

submit_l4 <- test_l4 %>%
  dplyr::select(Id) %>%
  mutate(SalePrice = exp(predict(model_l4, newdata = test_l4)))
write_csv(submit_l4, 'submission.csv')

Variables Importance

The final model favored quality and square footage related variables.

# make a ordered graph, filtered to the highest coefficients
compare <- data.frame(variables = c("(Intercept)",model_l4$coefnames ))
compare$coefs <- c(as.numeric(as.character(coef(model_l4$finalModel, model_l4$bestTune$lambda))))
compare <- compare %>% filter(variables!="(Intercept)")
# Plot variables with highest coefficients
compare %>% 
  filter(abs(coefs)>.005) %>% 
  ggplot(aes(reorder(variables, coefs), coefs)) +
  geom_bar(stat = "identity",position="dodge") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(title="Variables with Coefficients over .005", x="Variable", y="Coefficient Value")

```